Optimized text for full unicode and some escape sequences (#129169) by MitchLewis930 · Pull Request #7 · Signal65/elasticsearch-Copilot

MitchLewis930 · 2026-01-31T00:38:14Z

PR_017

…9169) Follow-up to elastic#126492 to apply the json parsing optimization to strings containing unicode characters and some backslash-escaped characters. Supporting backslash-escaped strings is tricky as it requires modifying the string. There are two types of modification: some just remove the backslash (e.g. \", \\), and some replace the whole escape sequence with a new character (e.g. \n, \r, \u00e5). In this implementation, the optimization only supports the first case--removing the backslash. This is done by making a copy of the data, skipping the backslash. It should still be more optimized than full String decoding, but it won't be as fast as non-backslashed strings where we can directly reference the input bytes. Relates to elastic#129072.

Copilot

Pull request overview

This PR optimizes text parsing for JSON strings containing full Unicode characters and certain escape sequences. The optimization allows the parser to return a direct byte reference to the underlying buffer instead of allocating new strings, extending this capability beyond simple ASCII strings.

Changes:

Extended the optimized path to support Unicode characters (UTF-8 encoded) and three specific escape sequences (\", \/, \\)
Modified the parser logic to track string length separately from byte length to correctly handle multi-byte UTF-8 characters
Added escape sequence handling that removes backslashes from supported sequences while maintaining buffer references

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
ESUTF8StreamJsonParser.java	Implemented optimized text parsing for Unicode and select escape sequences by tracking character length separately, handling UTF-8 multi-byte sequences, and copying buffers with backslash removal
ESUTF8StreamJsonParserTests.java	Added test cases for Unicode strings and escape sequences, replaced test helper with more comprehensive randomized input generation supporting various character types and escape sequences

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-31T00:38:49Z

...tent/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java

    protected int stringEnd = -1;
+    protected int stringLength;
+
+    private final List<Integer> backslashes = new ArrayList<>();


Using an ArrayList as an instance field that's cleared and reused could lead to memory retention if large strings with many backslashes are parsed. Consider using a smaller initial capacity or switching to a primitive int array with manual size tracking to reduce allocation overhead and memory footprint for the common case of few or no backslashes.

Suggested change

private final List<Integer> backslashes = new ArrayList<>();

private final List<Integer> backslashes = new ArrayList<>(4);

Copilot · 2026-01-31T00:38:49Z

...impl/src/test/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParserTests.java

+
+    private int randomCodepoint(boolean includeAscii) {
+        while (true) {
+            char val = Character.toChars(randomInt(0xFFFF))[0];


Character.toChars() returns an array that could contain two elements for supplementary characters (codepoints > 0xFFFF), but the code only takes the first element. Since randomInt(0xFFFF) limits values to the BMP, this works, but the method name 'randomCodepoint' suggests it should support all codepoints. Either rename the method to clarify it only handles BMP characters, or extend the range and handle supplementary characters properly.

Suggested change

char val = Character.toChars(randomInt(0xFFFF))[0];

char val = (char) randomInt(0xFFFF);

Copilot · 2026-01-31T00:38:49Z

...impl/src/test/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParserTests.java

+        boolean doesSupportOptimized = true;
+        for (int i = 0; i < length; ++i) {
+            if (forceSupportOptimized == false && randomBoolean()) {
+                switch (randomInt(9)) {


The magic number 9 in the switch statement is unclear. This should be documented or replaced with a named constant that explains why 10 possible values (0-9) are used, where case 0 handles escape sequences, case 1 handles unicode escapes, and default handles unicode characters. The probability distribution (1/10 for escapes, 1/10 for unicode escapes, 8/10 for unicode chars) appears intentional but is not obvious.

MitchLewis930 requested a review from Copilot January 31, 2026 00:38

Copilot AI reviewed Jan 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Optimized text for full unicode and some escape sequences (#129169)#7

Optimized text for full unicode and some escape sequences (#129169)#7
MitchLewis930 wants to merge 1 commit intopr_017_beforefrom
pr_017_after

MitchLewis930 commented Jan 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 31, 2026

Uh oh!

Copilot AI Jan 31, 2026

Uh oh!

Copilot AI Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	private final List<Integer> backslashes = new ArrayList<>();
	private final List<Integer> backslashes = new ArrayList<>(4);

	char val = Character.toChars(randomInt(0xFFFF))[0];
	char val = (char) randomInt(0xFFFF);

Comments

Conversation

MitchLewis930 commented Jan 31, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants